pycoQC Usage Notebook

Example files

pycoQC repository contains 6 example sequencing summary files generated with various version of Albacore. Each of those files contains only 10,000 reads.

  • ./docs/data/Albacore-1.2.1_basecall-1D-DNA_small_sequencing_summary.txt.gz
  • ./docs/data/Albacore-1.2.3_basecall-1D-RNA_small_sequencing_summary.txt.gz
  • ./docs/data/Albacore-1.7.0_basecall-1D-DNA_small_sequencing_summary.txt.gz
  • ./docs/data/Albacore-2.1.10_basecall-1D-DNA_small_sequencing_summary.txt.gz
  • ./docs/data/Albacore-2.1.10_basecall-1D-RNA_small_sequencing_summary.txt.gz
  • ./docs/data/Albacore-2.3.1_basecall-1D-RNA_small_sequencing_summary.txt.gz

Using pycoCQ

General information

pycoQC is a simple class that is initialized with a text summary file generated by ONT Albacore. For 1D run use the file named sequencing_summary.txt available the root of Albacore output directory. For 1D2, use sequencing_1dsq_summary.txt that cam be found in the 1dsq_analysis directory.

The instantiated object can be subsequently called with various methods that will generates tables and plots.

There are a few different ways to get help for all the public package functions:

  • In a separate window with the jupyter magic "?": ?pycoQC.channels_activity
  • In an output cell with the standard help function: help (pycoQC.channels_activity)
  • Inline with the cursor on the function of interest use shift + tab

Imports

For plotly offline plotting

Import pycoQC main class as well as Plotly and enable inline plotting in the current Notebook.

This is the recommended option. This ensures that your all your data are stored inside the notebook.

The limitation is that if generating many plots with large datasets the notebook will become quite heavy and slow.

In [41]:
# Run cell with Ctrl + Enter
from pycoQC.pycoQC import pycoQC
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode (connected=False)

For plotly online plotting

This option takes advantage of Plotly web-service for hosting graphs. This requires to set up an account (https://plot.ly/python/getting-started/#initialization-for-online-plotting) and to provide credentials in the notebook. This could be a good option for easy sharing of the interactive plots generated by pycoQC.

In [ ]:
# Only run this cell if you have set up a plotly account before and wants to use Plotly web-service 
from plotly.plotly import plot, iplot
import plotly.tools as pt
pt.set_credentials_file (username="XXXXXXXXXX", api_key="XXXXXXXXXX")

Initialisation

Upon initialization pycoQC reads the sequencing summary file, runs a series of tests and pre-process the data for plotting methods.

Sequencing_summary file

PycoQC can read compressed sequencing_summary.txt files (‘gzip’, ‘bz2’, ‘zip’, ‘xz’). Instead of a single file, it is also possible to pass a UNIX style regex to match multiple files

Depending on the run type and the version of Albacore used some informations might not be available. In particular calibration reads were not flagged in early versions of Albacore. When the field is available those reads are automatically discarded. Similarly barcodes information are only available in multiplexed runs.

Run type

The type of run (1D or 1D2) is automatically detected but can be explicitly enforced with run_type if needed

Run ID reordering

There are often several runids in a single sequencing_summary file. Unfortunately there are no ways to know the correct order based on the information contained in the sequencing_summary.txt file alone. By default pycoQC will automatically reorder the runs by decreasing throughput, which should normally reflect the sequencing order. However if you know the order you can specify it at initialisation with the option runid_list. This option can also be used to select specific run IDs

Minimal "pass" quality

By default pycoQC assumes that the minimal mean quality for a "pass" read is 7 (same as default Albacore value). However if you want to adjust the value, you can specify it at initialisation with min_pass_qual.

In [42]:
help (pycoQC.__init__)
Help on function __init__ in module pycoQC.pycoQC:

__init__(self, seq_summary_file, run_type=None, runid_list=[], min_pass_qual=7, verbose_level=1)
    Parse Albacore sequencing_summary.txt file and clean-up the data
    * seq_summary_file: STR
        Path to the sequencing_summary generated by Albacore 1.0.0 +. One can also pass a UNIX style regex to match
        multiple files with glob https://docs.python.org/3.6/library/glob.html
    * run_type: STR [Default None = autodetect]
        Force to us the Type of the run 1D or 1D2
    * runid_list: LIST of STR [Default []]
        Select only specific runids to be analysed. Can also be used to force pycoQC to order the runids for
        temporal plots, if the sequencing_summary file contain several sucessive runs. By default pycoQC analyses
        all the runids in the file and uses the runid order as defined in the file.
    * min_pass_qual INT [Default 7]
        Minimum quality to consider a read as 'pass'
    * verbose_level INT [Default 1]
        Level of verbosity, from 3 (Chatty) to 1 (Nothing)

Basic initialisation

In [13]:
# Run cell with Ctrl + Enter
p = pycoQC("./data/Albacore-1.7.0_basecall-1D-DNA_small_sequencing_summary.txt.gz")
print (p)
[pycoQC]
	Total reads: 6,957
	Pass reads: 5,594
	Minimal Pass Quality: 7
	Run Duration: 44.78 h
	Total Bases: 5,352,945
	Barcode found: True

Initialisation with summary file regex and maximum verbose level

In [14]:
p = pycoQC("./data/*RNA*", verbose_level=3)
Import raw data from sequencing summary files
	Sequencing summary files found: ['./data/Albacore-1.2.3_basecall-1D-RNA_small_sequencing_summary.txt.gz', './data/Albacore-2.1.10_basecall-1D-RNA_small_sequencing_summary.txt.gz', './data/Albacore-2.3.1_basecall-1D-RNA_small_sequencing_summary.txt.gz']
	30,000 reads found in initial file
Verify fields and discard unused columns
	1D Run type
	Columns found: ['read_id', 'run_id', 'channel', 'start_time', 'sequence_length_template', 'mean_qscore_template', 'calibration_strand_genome_template']
Drop lines containing NA values
	0 reads discarded
Filter out calibration strand reads
	3,207 reads discarded
Filter out zero length reads
	813 reads discarded
Sort run IDs by decreasing throughput
	Run-id order ['7ae4f0a6d2b7ba3e0248496b7de9cd5d1c028415', '5074e0cd71f372314c30ca5158aab2172d915023', 'c675730269f2f96f300f1cfa613fe89c53b344c3', '2b9163100702bba6ac29d37dbc96ccad740aa05d', 'd0054681152930b21276405d948b115e46968ca6', '71055637dd56eca9416305332eba1ed37bbfffe1', '9835d20f1d205bdbd1fb4d464ae778de95beab24', 'db5916f2fe7957afac1d0aaccdec883342c4bc31', '93fa1ad3ebc8a6e505d991bcb052c2b8ceb278b5', '17b317b994031430f350cda1dc13a72f66572ece']
	Reorder runids
	Processing reads with Run_ID 7ae4f0a6d2b7ba3e0248496b7de9cd5d1c028415 / time offset: 0
	Processing reads with Run_ID 5074e0cd71f372314c30ca5158aab2172d915023 / time offset: 5309.74734
	Processing reads with Run_ID c675730269f2f96f300f1cfa613fe89c53b344c3 / time offset: 15911.26726
	Processing reads with Run_ID 2b9163100702bba6ac29d37dbc96ccad740aa05d / time offset: 16306.31175
	Processing reads with Run_ID d0054681152930b21276405d948b115e46968ca6 / time offset: 16694.80113
	Processing reads with Run_ID 71055637dd56eca9416305332eba1ed37bbfffe1 / time offset: 17090.52523
	Processing reads with Run_ID 9835d20f1d205bdbd1fb4d464ae778de95beab24 / time offset: 61285.92364
	Processing reads with Run_ID db5916f2fe7957afac1d0aaccdec883342c4bc31 / time offset: 220697.58266999997
	Processing reads with Run_ID 93fa1ad3ebc8a6e505d991bcb052c2b8ceb278b5 / time offset: 393486.23117
	Processing reads with Run_ID 17b317b994031430f350cda1dc13a72f66572ece / time offset: 393850.01437
Reindex dataframe by read_ids
[pycoQC]
	Total reads: 25,980
	Pass reads: 20,453
	Minimal Pass Quality: 7
	Run Duration: 143.9 h
	Total Bases: 22,962,723
	Barcode found: False

Generating plots and tables

Interaction with Plotly library

Plots are generated with plotly for Python and return a plotly Figure object that can be used by users for:

  • Further customization using the numerous methods attached to the Figure object
  • Inline plotting in Jupyter Notebook using iplot (either from plotly.plotly or plotly.offline)
  • Generating a separate HTML file with plot (either from plotly.plotly or plotly.offline)
  • Exporting as a static image (https://plot.ly/python/static-image-export/), pdf (https://plot.ly/python/pdf-reports/) or various text formats.

In this notebook we will use the inline plotting option with the offline plotly library

Users can also customize the figures online in a user friendly environment by clicking on "Edit in Chart Studio" in the upper right corner of each figures.

Similarly static pictures can be exported using the "Download plot as a png" button.

Common arguments

All the methods have the arguments width and height that can be used to customize the plotting area. In general we do not recommend modifing these values as they might disrupt the plot layout.

Most of the methods also have the argument sample. By default pycoQC downsample the number of reads to 100,000 before plotting. This drastically reduces the processing time for large dataset and has a very limited impact on the plot aspect. The sampling is random but deterministic, meaning that you should always obtain the same results for the same dataset. The value can be changed to increase or decrease the number of reads. Alternatively, one can deactivate the behavior by specifying sample=False.

Overall data summary

The summary method generate a simple summary table with a clickable button to switch from "all reads" to "pass reads" only

In [43]:
help(pycoQC.summary)
Help on function summary in module pycoQC.pycoQC:

summary(self, width=None, height=None, plot_title='Run summary')
    Plot an interactive summary table
    * width: With of the ploting area in pixel
    * height: height of the ploting area in pixel

In [16]:
# Run cell with Ctrl + Enter
p = pycoQC("./data/*RNA_small_sequencing_summary.txt.gz")
fig = p.summary()
iplot (fig, show_link=False)

Read Length and Mean quality distribution

pycoQC has 3 methods to visualize the distribution of mean quality scores and of estimated read length:

  • reads_len_1D: An histogram of estimated read length in logarithmic scale
  • reads_qual_1D: An histogram of mean quality scores
  • reads_len_qual_2D: A density contour plot of estimated read length vs mean quality scores in semilog scale

Although we recommend to stick to default values, all 3 methods allow users to customize the plots.

  • The numbers of bin to divide the reads quality and/or length space in can be specified with nbins for the 1D plots and len_nbins / qual_nbins for the 2D plot
  • The intensity of line smoothing (using a gaussian kernel filter) can be specified
  • Additional cosmetic customization are available: color/colorscale
In [44]:
help(pycoQC.reads_len_1D)
Help on function reads_len_1D in module pycoQC.pycoQC:

reads_len_1D(self, color='lightsteelblue', width=None, height=500, nbins=200, smooth_sigma=2, sample=100000, plot_title='Distribution of read length')
    Plot a distribution of read length (log scale)
    * color: Color of the area (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
    * width: With of the ploting area in pixel
    * height: height of the ploting area in pixel
    * nbins: Number of bins to devide the x axis in
    * smooth_sigma: standard deviation for Gaussian kernel
    * sample: If given, a n number of reads will be randomly selected instead of the entire dataset

In [18]:
# Run cell with Ctrl + Enter
p = pycoQC("./data/Albacore-2.1.10_basecall-1D-RNA_small_sequencing_summary.txt.gz")
fig = p.reads_len_1D()
iplot(fig, show_link=False)
In [45]:
help(pycoQC.reads_qual_1D)
Help on function reads_qual_1D in module pycoQC.pycoQC:

reads_qual_1D(self, color='salmon', width=None, height=500, nbins=200, smooth_sigma=2, sample=100000, plot_title='Distribution of read quality scores')
    Plot a distribution of quality scores
    * color: Color of the area (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
    * width: With of the ploting area in pixel
    * height: height of the ploting area in pixel
    * nbins: Number of bins to devide the x axis in
    * smooth_sigma: standard deviation for Gaussian kernel
    * sample: If given, a n number of reads will be randomly selected instead of the entire dataset

In [20]:
# Run cell with Ctrl + Enter
p = pycoQC("./data/Albacore-2.1.10_basecall-1D-RNA_small_sequencing_summary.txt.gz")
fig = p.reads_qual_1D()
iplot(fig, show_link=False)
In [46]:
help(pycoQC.reads_len_qual_2D)
Help on function reads_len_qual_2D in module pycoQC.pycoQC:

reads_len_qual_2D(self, colorscale=[[0.0, 'rgba(255,255,255,0)'], [0.1, 'rgba(255,150,0,0)'], [0.25, 'rgb(255,100,0)'], [0.5, 'rgb(200,0,0)'], [0.75, 'rgb(120,0,0)'], [1.0, 'rgb(70,0,0)']], width=None, height=600, len_nbins=200, qual_nbins=75, smooth_sigma=2, sample=100000, plot_title='Mean read quality per sequence length')
    Plot a 2D distribution of quality scores vs length of the reads
    * colorscale: a valid plotly color scale https://plot.ly/python/colorscales/ (Not recommanded to change)
    * width: With of the ploting area in pixel
    * height: height of the ploting area in pixel
    * len_nbins: Number of bins to divide the read length values in (x axis)
    * qual_nbins: Number of bins to divide the read quality values in (y axis)
    * smooth_sigma: standard deviation for 2D Gaussian kernel
    * sample: If given, a n number of reads will be randomly selected instead of the entire dataset

In [22]:
# Run cell with Ctrl + Enter
p = pycoQC("./data/*RNA*")
fig = p.reads_len_qual_2D ()
iplot(fig, show_link=False)

Sequencing output and quality over experiment time

pycoQC can generate plot showing the evolution of the sequencing output (output_over_time) as well as the mean read quality (qual_over_time) over the course of the sequencing run.

Please be aware that if there are multiple run IDs in the source file(s), pycoQC reorder the run IDS by decreasing throughput/second as explained in Initialisation. This means that the over_time plots could be wrong, particularly when mixing several runs together.

For both functions the argument smooth_sigma can be used to modulate the smoothing factor of the gaussian filter, if you are not satisfied with the default result.

The colors of both plots can be fully customised:

  • cumulative_color and interval_color for output_over_time
  • median_color, quartile_color and extreme_color for quality_over_time
In [47]:
help(pycoQC.output_over_time)
Help on function output_over_time in module pycoQC.pycoQC:

output_over_time(self, cumulative_color='rgb(204,226,255)', interval_color='rgb(102,168,255)', width=None, height=500, sample=100000, plot_title='Output over experiment time')
    Plot a yield over time
    * cumulative_color: Color of cumulative yield area (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
    * interval_color: Color of interval yield line (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
    * width: With of the ploting area in pixel
    * height: height of the ploting area in pixel
    * sample: If given, a n number of reads will be randomly selected instead of the entire dataset

In [24]:
# Run cell with Ctrl + Enter
p  = pycoQC ("./data/Albacore-1.2.1_basecall-1D-DNA_small_sequencing_summary.txt.gz")
fig = p.output_over_time ()
iplot(fig, show_link=False)
In [48]:
help (pycoQC.qual_over_time)
Help on function qual_over_time in module pycoQC.pycoQC:

qual_over_time(self, median_color='rgb(102,168,255)', quartile_color='rgb(153,197,255)', extreme_color='rgba(153,197,255, 0.5)', smooth_sigma=1, width=None, height=500, sample=100000, plot_title='Mean read quality over experiment time')
    Plot a mean quality over time
    * median_color: Color of median line color (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
    * quartile_color: Color of inter quartile area and lines (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
    * extreme_color:: Color of inter extreme area and lines (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-col
    * smooth_sigma: sigma parameter for the Gaussian filter line smoothing
    * width: With of the ploting area in pixel
    * height: height of the ploting area in pixel
    * sample: If given, a n number of reads will be randomly selected instead of the entire dataset

In [26]:
# Run cell with Ctrl + Enter
p  = pycoQC ("./data/Albacore-2.1.10_basecall-1D-DNA_small_sequencing_summary.txt.gz")
fig = p.qual_over_time ()
iplot(fig, show_link=False)

Barcode distribution

When barcoding information is available, it is possible to generate a pie chart of the barcode count distribution. If no barcode information is available pycoQC throws an error.

It is not rare to have non-relevant barcodes detected at very low level. By default any barcode below 0.1% of the reads is excludes from the plot, but this can be changed with min_percent_barcode.

Similar to the previously described methods colors are customisable with colors.

In [49]:
help(pycoQC.barcode_counts)
Help on function barcode_counts in module pycoQC.pycoQC:

barcode_counts(self, min_percent_barcode=0.1, colors=['#f8bc9c', '#f6e9a1', '#f5f8f2', '#92d9f5', '#4f97ba'], width=None, height=600, sample=100000, plot_title='Percentage of reads per barcode')
    Plot a mean quality over time
    * min_percent_barcode: minimal percentage od total reads for a barcode to be reported
    * colors: List of colors (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
    * width: With of the ploting area in pixel
    * height: height of the ploting area in pixel
    * sample: If given, a n number of reads will be randomly selected instead of the entire dataset

In [28]:
# Run cell with Ctrl + Enter
p  = pycoQC ("./data/Albacore-1.2.3_basecall-1D-RNA_small_sequencing_summary.txt.gz")
fig = p.barcode_counts ()
iplot(fig, show_link=False)

Channels activity over time

Although the flowcell layout could be visually attractive (see https://github.com/mattloose/flowcellvis) this is not very informative on how the channels generate data during the run.

The channels_activity method generates a heatmap style plot showing the output over time per channel.

The number of channels can be changed to match Minion flowcells (512 default) or Promethion flowcells (3000).

The argument smooth_sigma can be used to modulate the smoothing factor of the gaussian smoothing filter

Colors can be changed with colorscale

In [50]:
help(pycoQC.channels_activity)
Help on function channels_activity in module pycoQC.pycoQC:

channels_activity(self, colorscale=[[0.0, 'rgba(255,255,255,0)'], [0.01, 'rgb(255,255,200)'], [0.25, 'rgb(255,200,0)'], [0.5, 'rgb(200,0,0)'], [0.75, 'rgb(120,0,0)'], [1.0, 'rgb(0,0,0)']], n_channels=512, smooth_sigma=1, width=None, height=600, sample=100000, plot_title='Output per channel over experiment time')
    Plot a yield over time
    * colorscale: a valid plotly color scale https://plot.ly/python/colorscales/ (Not recommanded to change)
    * n_channels: Overall number of expected channels (512 for Minion, 3000 for Promethion)
    * smooth_sigma: sigma parameter for the Gaussian filter line smoothing
    * width: With of the ploting area in pixel
    * height: height of the ploting area in pixel
    * sample: If given, a n number of reads will be randomly selected instead of the entire dataset

In [30]:
# Run cell with Ctrl + Enter
p  = pycoQC ("./data/Albacore-1.2.1_basecall-1D-DNA_small_sequencing_summary.txt.gz")
fig = p.channels_activity ()
iplot(fig, show_link=False)